Goal of the project

This part is a baseline model to do Topic Modelling. In this project, the goal is to fit an LDA model to the news articles and analyze and interpret the results intuitively. Here are the main elements of the analysis:

  • Corpus: Newspaper articles from Reuters. (Source: Kaggle )
  • Document: Single articles from the Corpus.
  • Topic: To be determined by the models. Note: there are five pre-determined categories in the database Inlandsnachrichten, Politik, Weltnachrichten, Top-Nachrichten, Breakingviews).

First to preprocess the data by cleaning it. Once the data is cleaned (see first section for the details of the cleaning implementation), some descriptive statistics are offered. In particular, the number of distinct words and the most frequent words appearing overall and by category.

In the subsequent section, we use publicly available lexicons to attribute “sentiments” to the words of our Corpus. We then analyse the amount of positive and negative sentiments in the Corpus.

Finally, we fit an LDA model and analyze what every Topic contains.

Main source: Text Mining in R.

NB: The second project (‘Text Mining’) analyzes the differences we do not simply take into account the stopwords but we use instead a frequency based approach. The second project also includes bigrams instead of single words.

1. Preprocessing the data

First we have to clean and preprocess the data. The following steps are made to do so:

  1. Keep only the articles with at least approx. 250 words (arbitrary number).
  2. Select a random sample of size 1’000.
  3. Remove all the punctuations, the possessive forms (‘s), the word ’reuter’.

First we import the data.

## -- Import data --
data = read_csv("news_en.csv")
documents = data[c(1,3,4)]

Then we do preliminary cleaning. In particular, keeping only the articles with at least 250 words (arbitrary number, approximation). Then we randomly select a subsample of size \(N\).

# Keep only the articles for which there are at least (approx.) 250 words.
documents = documents %>% 
  mutate(words = sapply(strsplit(Body, " "), length)) %>% 
  filter(words >= 250) %>% 
  select(-words)

# From this we select only a random subsamble of size 1000
set.seed(1234)
N = 1000 # number of articles to select
index = seq(1, dim(documents)[1], by = 1)
index_sample = c(sort(sample(index, N, replace = FALSE)))

# Select the random articles 
documents = documents[index_sample, ]

# Load the list of stop words
data("stop_words")
articles = data_frame()
for(i in 1:N){
  # Get the article
  doc = documents$Body[i]
  # Turns it to a string, remove numbers and put each sentence into a line
  # Each sentence is assumed to be separated by ". " 
  text = toString(doc)  %>% 
    removeNumbers() %>% 
    strsplit(". ", fixed = TRUE) %>% 
    unlist() %>% 
    tolower() 
  # Remove Punctuations
  text = gsub("[,]", "", text)
  text = gsub("[.]", "", text)
  text = gsub("[!]", "", text)
  text = gsub("[?]", "", text)
  text = gsub("[:]", "", text)
  text = gsub("’'s", "", text, fixed = TRUE)
  text = gsub("reuters", "", text, fixed = TRUE)
  # Turn the dataFrame into a data_frame (tipple), necessary for further parts
  # Also removes the stop words
  text_df <- data_frame(line = 1:length(text), text = text) %>% 
    unnest_tokens(word, text) %>% 
    anti_join(stop_words) 
  
  # Add information about the id (article number) and category, for descriptive stats
  text_df$id = rep(documents$id[i], dim(text_df)[1])
  text_df$Kat = rep(documents$Kat[i], dim(text_df)[1])
  
  # Put it together to get the articles
  articles = rbind(articles, text_df)
}
head(articles, 5)
## # A tibble: 5 x 4
##    line word       id Kat            
##   <int> <chr>   <int> <chr>          
## 1     1 federal    11 Top-Nachrichten
## 2     1 judge      11 Top-Nachrichten
## 3     1 tuesday    11 Top-Nachrichten
## 4     1 denied     11 Top-Nachrichten
## 5     1 inc’s      11 Top-Nachrichten

Now we have a good data set.

2. Descriptive Stats

The descriptive statistics concern the whole Corpus.

# Number of distinct words. 
nwords = length(unique(articles$word))
nwords
## [1] 27042
# Number of "documents" 
ndocs = length(unique(articles$id))
ndocs
## [1] 1000
# Number of articles / proportion by category
articles %>% 
  distinct(id, Kat) %>% 
  group_by(Kat) %>% 
  summarize(
    n = n(), 
    proportion = n/ndocs
  ) %>% 
  ungroup()
## # A tibble: 5 x 3
##   Kat                    n proportion
##   <chr>              <int>      <dbl>
## 1 Breakingviews          1      0.001
## 2 Inlandsnachrichten   139      0.139
## 3 Politik              212      0.212
## 4 Top-Nachrichten      272      0.272
## 5 Weltnachrichten      376      0.376
# Average number of words by category
# Note that it doesn't have the stop words anymore and that the first 
# filter to have at least 250 words is an approx.
articles %>% 
  group_by(Kat, id) %>% 
  summarize(
    n = n()
  ) %>% 
  ungroup() %>% 
  group_by(Kat) %>% 
  summarize(
    averageWords = round(mean(n,1))
  )
## # A tibble: 5 x 2
##   Kat                averageWords
##   <chr>                     <dbl>
## 1 Breakingviews               236
## 2 Inlandsnachrichten          245
## 3 Politik                     294
## 4 Top-Nachrichten             264
## 5 Weltnachrichten             280

And now some basic graphical representations:

# For the two graphs, be careful to adjust the filter(n > NUMBER) to get a reasonable graph
# Most frequent words
articles %>%
  count(word, sort = TRUE) %>%
  top_n(10, wt = n) %>% 
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_bar(stat = "identity") +
  xlab(NULL) +
  coord_flip() + 
  theme_bw() + 
  ggtitle("10 Most Frequent Words (overall)")

# Find the top 10 words 
top10 = articles %>% 
  count(word, sort = TRUE) %>% 
  top_n(10)

# Get the positions ot be the same as the graph before
positions = rev(top10$word)

# Most frequent words by category
articles %>% 
  group_by(Kat) %>% 
  count(word, sort = TRUE) %>% 
  filter(word %in% top10$word) %>%
  mutate(word = reorder(word, n)) %>%
  ungroup() %>% 
  ggplot(aes(reorder(word, n), n, fill = Kat)) +
  geom_bar(stat = "identity") +
  xlab(NULL) +
  scale_x_discrete(limits = positions)+
  coord_flip() + 
  theme_bw() +
  ggtitle("Most frequent words by category")

3. Adding Sentiments

For simplicity, we only keep words belonging to categories “positive” and “negative”. We start by comparing two different lexicons.

# Compare different lexicon
wordsSentiment = c()
lexicons = c("nrc", "bing")
i = 0
for(lex in lexicons){
  i = i + 1
  SA = get_sentiments(lex) %>% 
    filter(sentiment == "positive" | sentiment == "negative")
  
  # Count the number of words found by the lexicon 
  sent_data = articles %>% 
    inner_join(SA) 
  # Proportion of unique words comparison
  wordsSentiment[i] = length(unique(sent_data$word)) / length(unique(articles$word))
}
print(wordsSentiment)
## [1] 0.1146735 0.1033208
# Get the most represented lexicon
pos = which(wordsSentiment == max(wordsSentiment))
SA = get_sentiments(lexicons[pos]) %>% 
  filter(sentiment == "positive" | sentiment == "negative")

Note that changing the lexicon has a huge impact on the results. Indeed, “Trump” is not recognised in “nrc”, but it’s a very frequent word overall. Therefore, it changes the rest of the analysis.

Here are some descriptive tables.

# Tables
articles %>%
  inner_join(SA) %>%
  count(sentiment,sort = TRUE)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 positive  32702
## 2 negative  22232
articles %>%
  inner_join(SA) %>%
  count(word, sentiment,sort = TRUE)
## # A tibble: 3,151 x 3
##    word       sentiment     n
##    <chr>      <chr>     <int>
##  1 president  positive   1380
##  2 united     positive   1027
##  3 government negative    985
##  4 police     positive    663
##  5 including  positive    562
##  6 deal       positive    501
##  7 white      positive    501
##  8 foreign    negative    486
##  9 statement  positive    440
## 10 war        negative    396
## # ... with 3,141 more rows

Some wordclouds.

# Basic Wordcloud
articles %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

# Adding sentiments
articles %>%
  inner_join(
    get_sentiments("nrc") %>% filter(sentiment == "positive" | sentiment == "negative")
  ) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("red", "darkgreen"),
                   max.words = 100)

4. Latent Dirichlet Allocation

Now we try to fit the model. First we have to turn our data_frame object into a DocumentTermMatrix object, to which we will then fit the LDA model.

4.1. DocumentTermMatrix object

First, we have to turn our data into a DocumentTermMatrix object, to be able to fit the data to an lda model. In between, we add again the sentiments.

# Get a DocumentTermMatrix object
new_data = articles[c(3,2)]
colnames(new_data) = c("document", "term")
# Here is the required format
new_data = new_data %>% 
  group_by(document, term) %>% 
  mutate(count = n()) %>% 
  distinct() %>% 
  ungroup()

head(new_data, 5)
## # A tibble: 5 x 3
##   document term    count
##      <int> <chr>   <int>
## 1       11 federal     1
## 2       11 judge       2
## 3       11 tuesday     1
## 4       11 denied      1
## 5       11 inc’s       1
# Here is the transformation to get the DocumentTermMatrix object
dtm = new_data %>% 
  cast_dtm(document, term, count)
dtm
## <<DocumentTermMatrix (documents: 1000, terms: 27042)>>
## Non-/sparse entries: 202807/26839193
## Sparsity           : 99%
## Maximal term length: 27
## Weighting          : term frequency (tf)
inspect(dtm[1:4, 1:18])
## <<DocumentTermMatrix (documents: 4, terms: 18)>>
## Non-/sparse entries: 32/40
## Sparsity           : 56%
## Maximal term length: 14
## Weighting          : term frequency (tf)
## Sample             :
##     Terms
## Docs denied department donald federal house judge justice president trump
##   11      1          3      1       1     1     2       5         2     3
##   18      0          0      1       1     4     0       1         2     0
##   28      1          0      0       0     0     0       0         3     0
##   48      0          0      1       1     1     0       0         2     3
##     Terms
## Docs tuesday
##   11       1
##   18       0
##   28       3
##   48       0

4.2. Sentiment Contribution

Again, we add the positive and negative sentiment to the tidied object.

# Tidy the DocumentTermMatrix Object
dtm_tidy = tidy(dtm)
dtm_tidy
## # A tibble: 202,807 x 3
##    document term    count
##    <chr>    <chr>   <dbl>
##  1 11       federal     1
##  2 18       federal     1
##  3 48       federal     1
##  4 75       federal     2
##  5 106      federal     1
##  6 120      federal     1
##  7 447      federal     3
##  8 556      federal     2
##  9 581      federal     4
## 10 597      federal     2
## # ... with 202,797 more rows
# This will create a new tibble with the "terms" (= words) indicating in which 
# document they appear (document = article), and how many times each word appear by document. 

# Add the sentiments
dtm_sentiment <- dtm_tidy %>%
  inner_join(
    get_sentiments("nrc") %>% filter(sentiment == "positive" | sentiment == "negative"), 
    by = c(term = "word")
  )

head(dtm_sentiment,5)
## # A tibble: 5 x 4
##   document term   count sentiment
##   <chr>    <chr>  <dbl> <chr>    
## 1 11       denied     1 negative 
## 2 28       denied     1 negative 
## 3 355      denied     1 negative 
## 4 702      denied     1 negative 
## 5 997      denied     3 negative

Thanks to this step, we can say what are the estimated 5 more negative and positive articles, and print the headlines. Checking quickly the headlines allows us to assess the quality of the sentiment.

# Top 5 articles most negative sentiments -> with headlines
# Increase the id of the documents + 1 (because I did so previously in the code)
data$id = as.character(data$id + 1)
colnames(data)[1] = "document"
worst5 = dtm_sentiment %>%
  count(document, sentiment, wt = count) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  arrange(sentiment) %>% 
  top_n(-5, wt = sentiment) %>% 
  inner_join(data[,c(1,2,4,5)], by ="document") %>% 
  arrange(sentiment)

# Top 5 articles most positive sentiments
top5 = dtm_sentiment %>%
  count(document, sentiment, wt = count) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  arrange(sentiment) %>% 
  top_n(5, wt = sentiment) %>% 
  inner_join(data[,c(1,2,4,5)], by ="document") %>% 
  arrange(desc(sentiment))

# Print the headlines 
print("Headlines of most negative articles:")
## [1] "Headlines of most negative articles:"
for(i in 1:5){
  print(paste(as.character(i), ".", worst5$Headline[i]))
}
## [1] "1 . Intel hit with 32 lawsuits over security flaws"
## [1] "2 . Pope appeals to Syria's Assad to respect humanitarian law, let aid in"
## [1] "3 . Oil up 3 percent as Brexit chances dim; gasoline surges too"
## [1] "4 . Iran says we do not want nuclear weapons, there is no sunset clause in nuclear deal"
## [1] "5 . Turkey-EU deal aims to discourage illegal migration, Davutoglu says"
print("Headlines of most positive articles:")
## [1] "Headlines of most positive articles:"
for(i in 1:5){
  print(paste(as.character(i), ".", top5$Headline[i]))
}
## [1] "1 . Cuba's Fidel Castro made revolutionary mark on history"
## [1] "2 . Syria opposition's Hijab says need general ceasefire, talks at dead end"
## [1] "3 . White House adviser says return to Florida Keys may take weeks"
## [1] "4 . Britain ready to take other steps against Russian provocation: May's spokesman"
## [1] "5 . 'The Power of Love': address by U.S. bishop at Harry and Meghan's wedding"
# Contribution to sentiments
dtm_sentiment %>%
  count(sentiment, term, wt = count) %>%
  filter(n >= N/5) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(term = reorder(term, n)) %>%
  ggplot(aes(term, n, fill = sentiment)) +
  geom_bar(stat = "identity") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylab("Contribution to sentiment") 

4.3. Fitting the model

With 5 topics

# k^ is the number of topics. We can have more but 2 topics is easier to get a representation 
# and to understand what's going on
ap_lda <- LDA(dtm, k = 5, control = list(seed = 1234))
ap_lda
## A LDA_VEM topic model with 5 topics.
# Most likely word by topic
get_terms(ap_lda)
##    Topic 1    Topic 2    Topic 3    Topic 4    Topic 5 
##  "percent"   "people" "minister"    "trump"    "china"
# Most liekly topic by document
# get_topics(ap_lda, 1)
hist(get_topics(ap_lda,1), main = "Histogram of Topics")

ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics
## # A tibble: 135,210 x 3
##    topic term        beta
##    <int> <chr>      <dbl>
##  1     1 federal 1.23e- 3
##  2     2 federal 3.28e- 4
##  3     3 federal 2.42e- 4
##  4     4 federal 3.40e- 3
##  5     5 federal 9.30e- 4
##  6     1 judge   4.45e- 4
##  7     2 judge   8.64e- 5
##  8     3 judge   4.81e-36
##  9     4 judge   2.17e- 3
## 10     5 judge   2.02e- 5
## # ... with 135,200 more rows
ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

# Most frequent words by topics
ap_top_terms %>%
  # mutate(term = reorder(term, beta)) %>% 
  ggplot(aes(x = reorder(term,beta), y = reorder(beta, term), fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  theme_bw() +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

Some information about the \(\beta\) of the model.

beta_spread <- ap_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(
    topic1 > .001 | topic2 > .001 | topic3 > .001 | topic4 > .001 | topic5 > .001
    ) %>%
  mutate(log_ratio = log2(topic2 / topic1)) %>% 
  arrange(log_ratio)

# beta_spread

# Show words at the limit between two topics ? 
# Only interpretable for K  = 2 (two topics)  ? 
beta_spread %>%
  select(log_ratio, term) %>% 
  filter(abs(log_ratio) < 0.25) %>% 
  arrange(log_ratio) %>% 
  ggplot(aes(x = reorder(term, log_ratio), y = log_ratio)) + 
  geom_bar(stat = "identity") + 
  coord_flip()

We can also show the relationship between Topics and Categories

# GRAPH = Relationship between Category and Topic
# Source: https://www.datacamp.com/community/tutorials/ML-NLP-lyric-analysis#buildingmodels
category = documents[c(1,3)]
colnames(category) = c("document", "Kat")
category$document = as.character(category$document)

source_topic_relationship <- tidy(ap_lda, matrix = "gamma") %>%
  #join to the tidy form to get the genre field
  inner_join(category, by = "document") %>%
  select(Kat, topic, gamma) %>%
  group_by(Kat, topic) %>%
  #avg gamma (document) probability per genre/topic
  mutate(mean = mean(gamma)) %>%
  select(Kat, topic, mean) %>%
  ungroup() %>%
  #re-label topics
  mutate(topic = paste("Topic", topic, sep = " ")) %>%
  distinct()

circos.clear() #very important! Reset the circular layout parameters
#this is the long form of grid.col just to show you what I'm doing
#you can also assign the genre names individual colors as well
grid.col = c("Topic 1" = "grey", "Topic 2" = "grey", "Topic 3" = "grey",
             "Topic 4" = "grey", "Topic 5" = "grey")

#set the gap size between top and bottom halves set gap size to 15
circos.par(gap.after = c(rep(5, length(unique(source_topic_relationship[[1]])) - 1), 15,
                         rep(5, length(unique(source_topic_relationship[[2]])) - 1), 15))
chordDiagram(source_topic_relationship,  grid.col = grid.col, annotationTrack = "grid",
             preAllocateTracks = list(track.height = max(strwidth(unlist(dimnames(source_topic_relationship))))))
#go back to the first track and customize sector labels
#use niceFacing to pivot the label names to be perpendicular
circos.track(track.index = 1, panel.fun = function(x, y) {
  circos.text(CELL_META$xcenter, CELL_META$ylim[1], CELL_META$sector.index,
              facing = "clockwise", niceFacing = TRUE, adj = c(0, 0.5))
}, bg.border = NA) # here set bg.border to NA is important
title("Relationship Between Topic and Categories")

With 8 topics

# k is the number of topics
ap_lda <- LDA(dtm, k = 8, control = list(seed = 1234))
ap_lda
## A LDA_VEM topic model with 8 topics.
# Most likely word by topic
get_terms(ap_lda)
##    Topic 1    Topic 2    Topic 3    Topic 4    Topic 5    Topic 6 
##  "company"   "people" "minister"    "trump"    "china"  "percent" 
##    Topic 7    Topic 8 
##  "percent"    "syria"
# Most liekly topic by document
# get_topics(ap_lda, 1)
hist(get_topics(ap_lda,1), main = "Histogram of Topics")

ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics
## # A tibble: 216,336 x 3
##    topic term            beta
##    <int> <chr>          <dbl>
##  1     1 federal 0.00102     
##  2     2 federal 0.000446    
##  3     3 federal 0.000280    
##  4     4 federal 0.00378     
##  5     5 federal 0.0000000685
##  6     6 federal 0.000820    
##  7     7 federal 0.00250     
##  8     8 federal 0.0000974   
##  9     1 judge   0.000473    
## 10     2 judge   0.000131    
## # ... with 216,326 more rows
ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

# Most frequent words by topics
ap_top_terms %>%
  mutate(term = reorder(term, beta)) %>%  ggplot(aes(x = reorder(term, beta), y = beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  theme_bw() +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

Some information about the \(\beta\) of the model.

beta_spread <- ap_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(
    topic1 > .001 | topic2 > .001 | topic3 > .001 | topic4 > .001 | topic5 > .001 | topic6 > .001 | topic7 > .001 | topic8 > .001
    ) %>%
  mutate(log_ratio = log2(topic2 / topic1)) %>% 
  arrange(log_ratio)

# beta_spread

# Show words at the limit between two topics ? 
# Only interpretable for K  = 2 (two topics)  ? 
beta_spread %>%
  select(log_ratio, term) %>% 
  filter(abs(log_ratio) < 0.25) %>% 
  arrange(log_ratio) %>% 
  ggplot(aes(x = reorder(term, log_ratio), y = log_ratio)) + 
  geom_bar(stat = "identity") + 
  coord_flip()

We can also show the relationship between Topics and Categories

# GRAPH = Relationship between Category and Topic
# Source: https://www.datacamp.com/community/tutorials/ML-NLP-lyric-analysis#buildingmodels
category = documents[c(1,3)]
colnames(category) = c("document", "Kat")
category$document = as.character(category$document)

source_topic_relationship <- tidy(ap_lda, matrix = "gamma") %>%
  #join to the tidy form to get the genre field
  inner_join(category, by = "document") %>%
  select(Kat, topic, gamma) %>%
  group_by(Kat, topic) %>%
  #avg gamma (document) probability per genre/topic
  mutate(mean = mean(gamma)) %>%
  select(Kat, topic, mean) %>%
  ungroup() %>%
  #re-label topics
  mutate(topic = paste("Topic", topic, sep = " ")) %>%
  distinct()

circos.clear() #very important! Reset the circular layout parameters
#this is the long form of grid.col just to show you what I'm doing
#you can also assign the genre names individual colors as well
grid.col = c("Topic 1" = "grey", "Topic 2" = "grey", "Topic 3" = "grey",
             "Topic 4" = "grey", "Topic 5" = "grey", "Topic 6" = "grey",
             "Topic 7" = "grey", "Topic 8" = "grey")

#set the gap size between top and bottom halves set gap size to 15
circos.par(gap.after = c(rep(5, length(unique(source_topic_relationship[[1]])) - 1), 15,
                         rep(5, length(unique(source_topic_relationship[[2]])) - 1), 15))
chordDiagram(source_topic_relationship,  grid.col = grid.col, annotationTrack = "grid",
             preAllocateTracks = list(track.height = max(strwidth(unlist(dimnames(source_topic_relationship))))))
#go back to the first track and customize sector labels
#use niceFacing to pivot the label names to be perpendicular
circos.track(track.index = 1, panel.fun = function(x, y) {
  circos.text(CELL_META$xcenter, CELL_META$ylim[1], CELL_META$sector.index,
              facing = "clockwise", niceFacing = TRUE, adj = c(0, 0.5))
}, bg.border = NA) # here set bg.border to NA is important
title("Relationship Between Topic and Categories")

With 12 topics

# k^ is the number of topics. We can have more but 2 topics is easier to get a representation 
# and to understand what's going on
ap_lda <- LDA(dtm, k = 12, control = list(seed = 1234))
ap_lda
## A LDA_VEM topic model with 12 topics.
# Most likely word by topic
get_terms(ap_lda)
##    Topic 1    Topic 2    Topic 3    Topic 4    Topic 5    Topic 6 
##  "company"   "israel"    "trade"    "trump"    "china" "european" 
##    Topic 7    Topic 8    Topic 9   Topic 10   Topic 11   Topic 12 
##   "police"   "syrian"  "percent"    "court"  "islamic"  "percent"
# Most liekly topic by document
# get_topics(ap_lda, 1)
hist(get_topics(ap_lda,1), main = "Histogram of Topics")

ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics
## # A tibble: 324,504 x 3
##    topic term        beta
##    <int> <chr>      <dbl>
##  1     1 federal 1.79e- 3
##  2     2 federal 7.14e- 5
##  3     3 federal 4.50e- 5
##  4     4 federal 3.77e- 3
##  5     5 federal 3.03e-20
##  6     6 federal 1.63e-10
##  7     7 federal 3.23e- 3
##  8     8 federal 1.26e- 5
##  9     9 federal 1.34e- 3
## 10    10 federal 1.12e- 3
## # ... with 324,494 more rows
ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

# Most frequent words by topics
ap_top_terms %>%
  mutate(term = reorder(term, beta)) %>%  ggplot(aes(x = reorder(term, beta), y = beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  theme_bw() +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

Some information about the \(\beta\) of the model.

beta_spread <- ap_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(
    topic1 > .001 | topic2 > .001 | topic3 > .001 | topic4 > .001 | topic5 > .001 | topic6 > .001 | topic7 > .001 | topic8 > .001 | topic9 > .001 | topic10 > .001 | topic11 > .001 | topic12 > 0.01
    ) %>%
  mutate(log_ratio = log2(topic2 / topic1)) %>% 
  arrange(log_ratio)

# beta_spread

# Show words at the limit between two topics ? 
# Only interpretable for K  = 2 (two topics)  ? 
beta_spread %>%
  select(log_ratio, term) %>% 
  filter(abs(log_ratio) < 0.25) %>% 
  arrange(log_ratio) %>% 
  ggplot(aes(x = reorder(term, log_ratio), y = log_ratio)) + 
  geom_bar(stat = "identity") + 
  coord_flip()

We can also show the relationship between Topics and Categories

# GRAPH = Relationship between Category and Topic
# Source: https://www.datacamp.com/community/tutorials/ML-NLP-lyric-analysis#buildingmodels
category = documents[c(1,3)]
colnames(category) = c("document", "Kat")
category$document = as.character(category$document)

source_topic_relationship <- tidy(ap_lda, matrix = "gamma") %>%
  #join to the tidy form to get the genre field
  inner_join(category, by = "document") %>%
  select(Kat, topic, gamma) %>%
  group_by(Kat, topic) %>%
  #avg gamma (document) probability per genre/topic
  mutate(mean = mean(gamma)) %>%
  select(Kat, topic, mean) %>%
  ungroup() %>%
  #re-label topics
  mutate(topic = paste("Topic", topic, sep = " ")) %>%
  distinct()

circos.clear() #very important! Reset the circular layout parameters
#this is the long form of grid.col just to show you what I'm doing
#you can also assign the genre names individual colors as well
grid.col = c("Topic 1" = "grey", "Topic 2" = "grey", "Topic 3" = "grey",
             "Topic 4" = "grey", "Topic 5" = "grey", "Topic 6" = "grey",
             "Topic 7" = "grey", "Topic 8" = "grey", "Topic 9" = "grey", "Topic 10" = "grey", "Topic 11" = "grey", "Topic 12" = "grey")

#set the gap size between top and bottom halves set gap size to 15
circos.par(gap.after = c(rep(5, length(unique(source_topic_relationship[[1]])) - 1), 15,
                         rep(5, length(unique(source_topic_relationship[[2]])) - 1), 15))
chordDiagram(source_topic_relationship,  grid.col = grid.col, annotationTrack = "grid",
             preAllocateTracks = list(track.height = max(strwidth(unlist(dimnames(source_topic_relationship))))))
#go back to the first track and customize sector labels
#use niceFacing to pivot the label names to be perpendicular
circos.track(track.index = 1, panel.fun = function(x, y) {
  circos.text(CELL_META$xcenter, CELL_META$ylim[1], CELL_META$sector.index,
              facing = "clockwise", niceFacing = TRUE, adj = c(0, 0.5))
}, bg.border = NA) # here set bg.border to NA is important
title("Relationship Between Topic and Categories")

Improvements

Now we try two different approaches to account for the shortcomings of the previous part.

1. Frequency

colnames(data)[1] = "id"
data$id = as.character(data$id)
articles$id = as.character(articles$id)
dat <- articles %>% 
  left_join(data[, c("id", "Date")]) %>% 
  select(-line)

total_words <- dat %>% 
  group_by(Kat) %>% 
  count(word, sort = T) %>% 
  mutate(total = sum(n)) %>% 
  mutate(fraction = n/total)

## Plot the words frequency
ggplot(total_words, aes(fraction, fill = Kat)) +
  geom_histogram(show.legend = FALSE) +
  xlim(NA, 0.0009) +
  facet_wrap(~Kat, ncol = 2, scales = "free_y")

# Term Frequency
freq_by_rank <- total_words %>% 
  group_by(Kat) %>% 
  distinct() %>% 
  arrange(desc(fraction)) %>% 
  mutate(rank = row_number())

# Exponential decay
freq_by_rank %>% 
  ggplot(aes(rank,fraction, color = Kat)) + 
  geom_line(size = 1, alpha = 0.8, show.legend = FALSE) + 
  scale_x_log10() +
  scale_y_log10() +
  labs(title = "Frequency by rank") +
  theme_bw()

# Calculate and Bind the term frequency
dat <- total_words %>%
  bind_tf_idf(word, Kat, n) 

dat3 <- dat  %>%
  group_by(Kat) %>% 
  distinct(tf_idf, .keep_all = T) %>% 
  top_n(n = 10, wt = tf_idf) 

dat3 %>%
  ggplot(aes(reorder(word, tf_idf), tf_idf, fill = Kat)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~Kat, ncol = 2, scales = "free") +
  coord_flip()

2. Bigrams

dat2 <- data[index_sample,] 
  
dat2 <- dat2 %>% 
  unnest_tokens(bigram, Body, token = "ngrams", n = 2) %>% 
  select(-Headline) 

dat2$bigram <- gsub("[^A-Za-z ]","",dat2$bigram)

dat2 <- dat2 %>%
  separate(bigram, c("word1", "word2"), sep = " ")

dat2 <- dat2 %>% 
  filter(!word1 == "",
         !word2 == "")

dat3 <- dat2 %>%
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word)

##reunite the bigrams after having removed stop words
bigrams_united <- dat3 %>%
  unite(bigram, word1, word2, sep = " ") 

total_words <- bigrams_united %>% 
  group_by(Kat) %>% 
  count(bigram, sort = T) %>% 
  mutate(total = sum(n)) %>% 
  mutate(fraction = n/total)

dat3 <- total_words %>%
  bind_tf_idf(bigram, Kat, n) 

## Plot bigram words
dat3  %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(bigram, levels = rev(unique(bigram)))) %>% 
  group_by(Kat) %>% 
  distinct(tf_idf, .keep_all = T) %>% 
  top_n(n = 10, wt = tf_idf) %>%
  ungroup() %>% 
  ggplot(aes(reorder(word, tf_idf), tf_idf, fill = Kat)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~Kat, ncol = 2, scales = "free") +
  coord_flip() 

Check association of words in the bigram:

dat3 %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(word1 == "gun") %>%
  count(word2, sort = TRUE)
## # A tibble: 27 x 3
## # Groups:   Kat [3]
##    Kat                word2            nn
##    <chr>              <chr>         <int>
##  1 Inlandsnachrichten control           1
##  2 Inlandsnachrichten demonstrators     1
##  3 Inlandsnachrichten enthusiasts       1
##  4 Inlandsnachrichten exports           1
##  5 Inlandsnachrichten king              1
##  6 Inlandsnachrichten laws              1
##  7 Inlandsnachrichten lobby             1
##  8 Inlandsnachrichten maker             1
##  9 Inlandsnachrichten makers            1
## 10 Inlandsnachrichten manufacturers     1
## # ... with 17 more rows

Issues sentiment Analysis of negation:

negation_words <- c("not", "no", "never", "without")

AFINN <- get_sentiments("afinn")

dat4 <- dat2 %>%
  filter(word1 %in% negation_words) %>%
  filter(!word2 %in% stop_words$word) %>% 
  inner_join(AFINN, by = c(word2 = "word")) %>%
  count(word1, word2, score, sort = TRUE)

dat4 %>%
  mutate(contribution = n * score) %>%
  arrange(desc(abs(contribution))) %>%
  head(20) %>%
  mutate(word2 = reorder(word2, contribution)) %>%
  ggplot(aes(word2, n * score, fill = n * score > 0)) +
  geom_col(show.legend = FALSE) +
  xlab(paste0("Words preceded by negation words")) +
  ylab("Sentiment score * number of occurrences") +
  coord_flip()

Plots sentiment over the first 1’000 articles.

dat <- data[1:1000,] %>%  unnest_tokens(word, Body)  

data(stop_words)

dat <- dat %>% anti_join(stop_words, by = "word") %>%
  select(-Headline)

dat$word <- gsub("[^A-Za-z]","",dat$word)

dat <- dat %>% 
  filter(!word == "")

dat$Kat <- as.factor(dat$Kat) 

sentim <- dat %>%
  filter(!word %in% negation_words) %>%
  inner_join(AFINN, by = c(word = "word")) %>%
  count(word, score, sort = TRUE) 

sentim <- left_join(dat, sentim, by ="word") %>% 
  filter(!score == "NA") 

total_sentim <- sentim %>% 
  group_by(Kat) %>% 
  mutate(total = sum(score)) %>% 
  mutate(fraction = n/total)


total_sentim <- total_sentim %>% 
  separate(Date, c("year", "month", "day"), sep = "-") %>% 
  select(-c("month", "day"))


pirateplot(formula =  score ~ Kat + year, 
   data = total_sentim, 
   xlab = NULL, ylab = "Sentiment score", 
   main = "Sentiment score distribution by newspaper and month", 
   pal = "google", 
   point.o = .2, 
   avg.line.o = 1, 
   theme = 0, 
   point.pch = 16, 
   point.cex = 1.5, 
   jitter.val = .1, 
   cex.lab = .9, cex.names = .7) 

Plot bigrams in networks maps.

dat5 <- total_words %>%
  bind_tf_idf(bigram, Kat, n)

bigram_graph <- dat5  %>% 
  filter(n > 45)  %>% 
  graph_from_data_frame()

bigram_graph
## IGRAPH 7ff17bb DN-- 30 31 -- 
## + attr: name (v/c), n (e/n), total (e/n), fraction (e/n), tf
## | (e/n), idf (e/n), tf_idf (e/n)
## + edges from 7ff17bb (vertex names):
##  [1] Weltnachrichten->north korea      Politik        ->white house     
##  [3] Weltnachrichten->prime minister   Politik        ->donald trump    
##  [5] Politik        ->president donald Weltnachrichten->told reuters    
##  [7] Weltnachrichten->north korean     Weltnachrichten->united nations  
##  [9] Weltnachrichten->foreign minister Weltnachrichten->north koreas    
## [11] Weltnachrichten->president donald Weltnachrichten->donald trump    
## [13] Weltnachrichten->security council Weltnachrichten->south korea     
## + ... omitted several edges
set.seed(2017)

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  # xlim(-5,7)+
  # ylim(1,10)+
  xlab(NULL)+
  ylab(NULL)+
  theme(axis.text.x=element_blank(),
      axis.text.y=element_blank())+
  theme_bw()